Starbucks Capstone Challenge

Project Overview

In this project, I will try to find how Starbucks customers use the app, and how well is the current offers system. I will also see who should the app target in promotions. The data sets used in this project contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. From it, we can understand the costumers' behavior and it might help us make better decisions.

The Problem

The problem we have here is that we don't want to give any customer our offers. We want to give only those who we think will be able to complete the offer. Giving an offer to someone we know he/she probably will not be able to complete it is a waste of time and resources that can be given to someone who we know will complete it. I will approach this problem by first cleaning up the data, then doing some exploratory analysis and see who are my most valuable customers after that I will create a model to help us predicting feature customers and which type of offer should we give them.

Analysis

Part I: Business understanding

My objective here is to find patterns and show when and where to give specific offer to a specific customer. Main users of this kind of applications are Starbucks employees and analysts. The plan in this project to have questions and answer them with data visualization. Tha data is provided by Starbucks contains simulated data that mimics customer behavior.

Part II: Data Exploration

In this project we were given 3 files. Before I start analyzing we have to explore and see what is the data we have. We need to check if it is clean or not, if each column have the right type that the data tell, for example if the data in column called price is saved as string, we need to convert it to number to help us in the analysis if we want to find the sum for example, having it as string will not return the total of that column. Similar thing goes to dates saved as strings.

Part III: Data understanding

The data we have is provided by Starbucks. Here is a quick breakthrough of how the data looks like:

portfolio.json

profile.json

transcript.json

Part IV: Data preparation / Wrangling

In this part, I will go though the data and do some data wrangling and fix some issues with coulmns such as Value in the transcript table, and Channels in the portfolio and others.

First list check data types

One issue with the portfolio dataframe is having a list of items in the channels column, to fix this one I will do something similar to one-hot-encoding and make three new columns for each type of channels and put 1 when it applies on that promotion and 0 if it doesn't apply.

For portfolio dataframe, we can see that we don't have NaN values.

First we can notice that we have some NaN values in the gender and income, lets find the sum of in all columns.

As we can see we have some NaN values in gender and income columns, for gender I will fill NaNs with N/A, and for income I will fill NaNs with the mean.

While working through the data I notices that there is age number 118 which seems impossible to happen and to make sure I also noticed that for all profiles that have age equal to 118, they also don't have gender listed so it might be written wrong or it is the default value. For those values I will keep 118 as it is.

Lastly lets check transcript dataframe

Checking for NaN values we can see that we have none, we have to clean the value column since it hold dictonary of offer id, amount or reward.

As we can see above there are offer_id and offer id, after fixing this we need to concatinate those columns togther since they are the same.

Now no need to keep the value column and we can drop it.

Now, the data looks ready for some analysis and machine learning.

Part V: Analysis / Modeling

Divided to two parts, Analysis and Modeling, I will start with Analysis where I will analyze the data and try to find insights.

1. Analysis:

A. Univariate Exploration:

1- How many varibales do I have in each dataframe? What are the types?
2- What are the most common values for each column in each dataframe?
3- What is the average income for Starbucks customers?
4- What is the average age for Starbucks customers?
5- What is the most common promotion?
6- What is the least common promotion?
7- Who are the most loyal customer (most transcripts)?
8- What are the most events we have in our transcripts?

B. Multivariate Exploration:

1- What is the most common promotion for children, teens, adult and elderly customors?
2- From profiles, which get more income, males or females?
3- What is the gender distribution in the transcript dataframe?
4- Who takes long time to achieve each promotion goal and from which gender, age, income?
5- How many customers we get each month (became_member_on)?
6- What is the average length between two transcript for the same customer? 
7- Which type of promotions each gender likes (offer_type)?
8- From each offer received by customer, how many they completed?

A. Univariate Exploration:

A1: How many varibales do I have in each dataframe? What are the types?

We already looked into the shape and check of NaNs and viewed the type of each column. Lets see for numerical and categorical data, how many each appeared, what is the mean and min/max .. etc.

Form the about box plot, we can see that most of ages in our profile dataframe falls in-between 40 and 80. We already notice one outlier which is 118. Our median is around 58 years old.

What about income?

Our boxplot shows that the median is around 65k and most of incomes falls between 50k and 78k.

A2: What are the most common values for each column in each dataframe?

For Age, we have 85 different ages, so to make the graph look better I think we should divide ages to groups:

We can see that we have alot profile in the adult age group, ages between 21 and 64, lets see more about this and look inside the adult age group and see how it is divided.

For become_member_on column, because we have alot of dates, I will group it by month.

We can see that we did better job between August 2017 and January 2018. We were able to get more than 800 new profiles monthly

For transcript dataframe, lets see the common values we have.

We can see that most of the transcripts are transactions. Around 75% of the offer received were viewed. And nearly 50% of the viewed offers were completed.

A3: What is the average income for Starbucks customers?

A4: What is the average age for Starbucks customers?

A5: What are the most common promotion? and what are their types?

Here I will be looking for only completed promotions.

Now lets look at the most common types of offers, to find that I need to get the offer type from the portfolio dataframe.

As we saw in the above graph, it is pretty close between BOGO offers and discount offers.

A6: What is the least common promotion?

A7: Who are the most loyal customer (most transcripts)?

For this one, I will check offer completed and transactions event types.

A8: What are the most events we have in our transcripts?

Transaction have the most amount of rows in the transcript dataframe with around 140k, almost half of our dataframe total.

B. Multivariate Exploration:

B1: What is the most common promotion for children, teens, young adult, adult and elderly customors?

To find out, we need to get customer's age in the transcript dataframe. I created a function to get that from the profile dataframe (It takes time to run). I will ignore 'child' age group since we have no rows of it.

B2: From profiles, which get more income, males or females?

Here I ignored those who didn't tell their gender.

The graph above shows that income median (the white dot) for females (around 70k) is higher than males (around 60k) we can also see that for females the income spreads from 40k to 100k. For males most of them around 40k to 70k which close to median.

B3: What is the gender distribution in the transcript dataframe?

We need to add gender also here, I will make function to do so (It takes time to run).

From the two graphs above we can see that males received offers more than females. Both genders seems to reflect on those offers similarly. Around half of 80% of offers received were viewed by both genders, but it seems that females would complete those offers more than males. The numbers are:

The numbers above shows that males receive offers more than females by 9% and their transaction is 19% more too, which tells that they both more than females. Regarding offers, Males and Females received the same amount of BOGO and discount offers.

B4: Who takes long time to achieve each promotion goal and from which gender, age, income?

They are pretty close if it not similar. Both males and females take about 15 days to complete and offer.

B5: What is the average length between two transcript for the same customer?

The mean time it tekes a customer to complete an offer is around 16 days (390 hours)

B6: Which type of promotions each gender likes (offer_type)?

We can see that both genders like bogo and discount offers and they have the same reaction to informational offers, they both seem to be not intersted to it.

B7: From each offer received by customer, how many they completed?

Females completed 56% of the offers they received, it is 13% more than males, but males made more transactions than females, 64% to 43%.

Note:

Now since I did alot of changes to the transcript dataframe and it takes time to do the changes evry run, i'll convert the dataframe to csv file and just run it whenever I come back to work on the project. I will add one last column for the datafame which is income that might help us in the model (It takes time to run).

Before working on the model I will take care of the NaN values in gender, offer_type and offer_id. offer_id and offer_type are NaN because theose records are transactions, so they are not offer received, viewed or completed, because of that, we have the same number of NaNs in offer_type since they are not offers. I will replace NaNs in both columns with NA. Similarly in gender, some didn't tell their gender and we saw that before in profile dataframe, and we replaced it with NA.

2. Modeling:

In this part, I will try to make a model that can identify which kind of offers we should give a customer. First lets take quick look at our final dataframe before modeling.

Because my model will guess the offer_type, I will only get those transcripts with offer id's.

Now, we should split our dataframe to features and target. Our features here are:

And my target is offer_type. For my target, I will replace texts with numbers. Where BOGO = 1, discount = 2, informational = 3.

We should normalize the numerical values ( Time, amount, reward, income ) because we will use them as features.

Creating training and testing sets

Metrics:

Since we have a simple classification problem, I will use accuracy to evaluate my models. We want to see how well our model by seeing the number of correct predictions vs total number of predicitons.

I'll try different models to pick the best out of them.

1. LogisticRegression

2. K-Nearest Neighbors

3. Decision Tree

4. Support Vector Machine

5. Naive Bayes

6. Random Forest

Models Results

The above table, shows the accuracy score related with using different models of supervised learning. As presented on the table, we had 100% accuracy in both training and testing sets for 3 models ( out of 6). To avoid over fitting as much as possible, I will choose the model that gave me the middle accuracy score on the testing set,which is the Logistic Regression 79.73%. On this model , I got 79.73% on testing set. I know that this is a resonabley good score, but the other scores are higher than that (except GaussianNB). Logistic Regression is better used here since we have few binomial outcomes ( BOGO = 1, discount = 2, informational = 3 ). It is good here because we have good amount of data to work with.

Model Improvements

The result from my Logistic Regression is good enough and we got fair number and avoided overfitting. But lets try to improve it little bit.

We improve slightly, 0.36% more. I still think it is good as it is and we don't need to try to get better results.

Althgough I believe on the saying " There is always a room for Improvement", But I think that the KNeighborsClassifier model is giving me a really good score. Trying to improve such model will surely cause us to get into the fault of Overfitting. So, I will not suggest any improvement on this model since I believe that we don't need to try to get better results.

Conclusion

In this project, I tried to analyze and make model to predict the best offer to give a Starbucks customer. First I explored the data and see what I have to change before start the analysis. Then I did some exploratory analysis on the data after cleaning. From that analysis I found out that most favorite type of offers are Buy One Get One (BOGO) offers and Discount offers. I digged deep to see who and what type of customers we have and noticed that Females tend to complete offers more than males with 56% completion of the offers they received. Where Males completed only 43.18% from the offers they received. But our current data shows that we gave males more offers since they have more transactions than females with total number of 72794 transactions, where females only had 49382 transactions. In conclusion, the company should give more offers to Females than Males since they have more completed offers. And they should focus more on BOGO and Discount offers since they are the one that tend to make customers buy more.

Improvements

I think I got to a point where we have good results and we understand the data we have very well. But to make our results even better, I would try to improve my data collection and fix issues I have with NaN values. I will also try to get even more data like location and when the transaction were completed, which branch and what time of the day. All these data can help us know when and where to give our offers. Also having more data is always good think to help us improve our model results.